We will start by checking if the required libraries and files are installed. Then we will load the libraries using library().
if (!requireNamespace("BiocManager", quietly = TRUE)){
install.packages("BiocManager")
}
if (!requireNamespace("GEOmetadb", quietly = TRUE)){
BiocManager::install("GEOmetadb")
}
if (!requireNamespace("biomaRt", quietly = TRUE)){
BiocManager::install("biomaRt")
}
if (!requireNamespace("edgeR", quietly = TRUE)){
BiocManager::install("edgeR")
}
library(BiocManager)
library(edgeR)
## Warning: package 'edgeR' was built under R version 4.1.1
library(GEOmetadb)
## Warning: package 'RSQLite' was built under R version 4.1.2
library(biomaRt)
## Warning: package 'biomaRt' was built under R version 4.1.1
library(knitr)
## Warning: package 'knitr' was built under R version 4.1.2
if (!file.exists('GEOmetadb.sqlite')){
getSQLiteFile()
}
Now with all libraries installed, we will use the following commands to query the GEOmetadb (GEO database) to look for a dataset with a “raw counts” txt-supplmentary file. Note we are specifically querying datasets with submission dates > 2015-01-01, related to cancer and high throughput sequencing data of Homo sapiens.
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.submission_date > '2015-01-01' AND",
" gse.title LIKE '%cancer%' AND",
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%high-throughput seq%' ",
" ORDER BY gse.submission_date DESC",sep=" ")
con <- dbConnect(SQLite(), 'GEOmetadb.sqlite')
rs <- dbGetQuery(con, sql)
unlist(lapply(rs$supplementary_file,
FUN = function(x){x <- unlist(strsplit(x,";")) ;
x <- x[grep(x,pattern="txt",ignore.case = TRUE)];
tail(unlist(strsplit(x,"/")),n=1)})) [1:30]
## [1] "GSE166697_Urinary_miRNA_counts.txt.gz"
## [2] "GSE166417_gene_expression.txt.gz"
## [3] "GSE165883_2019.01.15.processed.cpm.txt.gz"
## [4] "GSE165842_vst.txt.gz"
## [5] "GSE165452_ACY241_PRL_RNAseq_no_rRNA_Kallisto_Gene_Counts_matrix.txt.gz"
## [6] "GSE165247_RawCounts_Matrix_2019.txt.gz"
## [7] "GSE165115_Transcriptome_counts.txt.gz"
## [8] "GSE165115_Transcriptome_counts.txt.gz"
## [9] "GSE164730_expression_profile.txt.gz"
## [10] "GSE164531_NEDD9_count_table.txt.gz"
## [11] "GSE164169_T47DZOE_T47DPCDH.anno.txt.gz"
## [12] "GSE163366_otu_tax.txt.gz"
## [13] "GSE163374_rat.raw.counts.txt.gz"
## [14] "GSE162737_TE-1_mRNA_transcripts.FPKM.txt.gz"
## [15] "GSE162726_barcode_supplementary.txt.gz"
## [16] "GSE162515_RNAseq_rawCounts.txt.gz"
## [17] "GSE162564_22RV1_CountTable.txt.gz"
## [18] "GSE162353_S100A9_MDSC_RNAseq.txt.gz"
## [19] "GSE162285_gene_raw_counts_matrix.txt.gz"
## [20] "GSE162215_all.fpkm.txt.gz"
## [21] "GSE162104_count.txt.gz"
## [22] "GSE162004_gene_expression.txt.gz"
## [23] "GSE161691_TMM_normalized_reads_count_per_million_LNCaP.txt.gz"
## [24] "GSE161509_all_samples.fragments_per_gene.txt.gz"
## [25] "GSE161502_Raw_Counts.txt.gz"
## [26] "GSE161349_Counts_noadj_exons_condense.txt.gz"
## [27] "GSE161243_compiled_counts.txt.gz"
## [28] "GSE160796_gene_fpkm.txt.gz"
## [29] "GSE160693_Normalized_Gene_Counts_Matrix.txt.gz"
## [30] "GSE160723_read_counts.txt.gz"
The chosen dataset is “GSE164531” which is an RNA-seq dataset using the Illumina Hiseq 2000 platform for identification of NEDD9 regulated genes in VCaP prosatate cancer cells. We download the counts data using the following commands and check the dimensions of the data.
We then load the counts data into R and check the dimensions.
data <- read.delim(fnames[1], header = TRUE, check.names = FALSE)
dim(data)
## [1] 58736 8
Then we quickly check the head of the data (first 10 rows) to explore further.
kable(head(data), format = 'html', caption = "Table 0.1: GSE164531 Dataset")
| EnsemblID | GeneSymbol | shNEDD9_1_S10 | shNEDD9_2_S11 | shNEDD9_3_S12 | shNTC_1_S7 | shNTC_2_S8 | shNTC_3_S9 |
|---|---|---|---|---|---|---|---|
| ENSG00000223972 | DDX11L1 | 0 | 0 | 2 | 0 | 0 | 0 |
| ENSG00000227232 | WASH7P | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000278267 | MIR6859-1 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000243485 | MIR1302-2HG | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000284332 | MIR1302-2 | 0 | 0 | 0 | 0 | 0 | 0 |
| ENSG00000237613 | FAM138A | 0 | 0 | 0 | 0 | 0 | 0 |
The first exploratory analysis is checking for duplicate genes in our data. We run the following commands and find that there are no duplicate genes present in the data. Resulting in no further actions required to deal with duplicate genes.
summarized_gene_counts <- sort(table(data$EnsemblID),decreasing = TRUE)
kable(summarized_gene_counts[which(summarized_gene_counts > 1)[1:10]], format = 'html', caption = "Table 0.2: Duplicate genes and it's frequencies")
| Var1 | Freq |
|---|---|
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
| NA | NA |
Now we check for any missing data in the dataset.
# Check which rows have missing data
which(!complete.cases(data))
## [1] 58736
We see that row 58736 has missing data, we access this row to explore further.
data[58736,]
Seems like this row is just an empty row with no meaningful data. As it contains no data, we can safely remove the row without affecting the data-set (We remove the row as it N/A values may interfere with next steps such as normalization).
data <- data[complete.cases(data),]
dim(data)
## [1] 58735 8
We will define groups for the data according to each biological condition (experiment design) for later steps such as normalization.
samples <- data.frame(lapply(colnames(data)[3:8], FUN=function(x){unlist(strsplit(x, split = "_"))[c(1, 2)]}))
colnames(samples) <- colnames(data)[3:8]
rownames(samples) <- c("condition", "replicate")
samples <- data.frame(t(samples))
samples
We now filter out low counts from the data according to the edgeR protocol. The threshold is set to 3 because edgeR recommends the threshold to be the number of replications which in our data set is 3 (Would be interesting to see what happens if we change our threshold, but we will keep it to 3 for now).
cpms <- edgeR::cpm(data[, 3:8])
rownames(cpms) <- data[, 1]
# Threshold set to 3 as recommended by edgeR protocol
keep <- rowSums(cpms > 1) >= 3
dataFiltered <- data[keep, ]
head(dataFiltered)
We quickly compare the dimensions of the filtered data and original data.
dim(data)
## [1] 58735 8
kable(head(dataFiltered), format = 'html', caption = "Table 0.3: Filtered Data")
| EnsemblID | GeneSymbol | shNEDD9_1_S10 | shNEDD9_2_S11 | shNEDD9_3_S12 | shNTC_1_S7 | shNTC_2_S8 | shNTC_3_S9 | |
|---|---|---|---|---|---|---|---|---|
| 20 | ENSG00000279457 | FO538757 | 6 | 12 | 12 | 7 | 10 | 9 |
| 34 | ENSG00000225630 | MTND2P28 | 264 | 298 | 417 | 365 | 153 | 310 |
| 35 | ENSG00000237973 | MTCO1P12 | 25 | 34 | 36 | 46 | 23 | 56 |
| 39 | ENSG00000248527 | MTATP6P1 | 835 | 1060 | 1291 | 1219 | 525 | 1166 |
| 44 | ENSG00000228327 | AL669831 | 14 | 16 | 22 | 6 | 3 | 7 |
| 51 | ENSG00000228794 | LINC01128 | 248 | 300 | 368 | 287 | 146 | 239 |
We notice that the total number of genes reduced to 14682 from 58735 after filtering out low expression data using the edgeR filtering protocol.
Now we will map the filtered data to HUGO gene symbols using grch38.p13 and the biomaRt package. The following commands creates a ensembl_gene_id and HUGO gene symbol conversion table which we will use to merge with the expression data after normalization.
ensembl <- useEnsembl("ensembl")
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
conversionStash <- "data_conversion.rds"
if(file.exists(conversionStash)){
dataIdConversion <- readRDS(conversionStash)
} else {
dataIdConversion <- getBM(attributes = c("ensembl_gene_id", "hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = dataFiltered$EnsemblID,
mart = ensembl)
saveRDS(dataIdConversion, conversionStash)
}
kable(head(dataIdConversion), format = 'html', caption = "Table 0.4: Ensembl Gene Id to hgnc symbol")
| ensembl_gene_id | hgnc_symbol |
|---|---|
| ENSG00000000457 | SCYL3 |
| ENSG00000000460 | C1orf112 |
| ENSG00000001084 | GCLC |
| ENSG00000001167 | NFYA |
| ENSG00000001460 | STPG1 |
| ENSG00000001461 | NIPAL3 |
We will now normalize our data using the TMM method using the following commands.
filteredDataMatrix <- as.matrix(dataFiltered[,3:8])
rownames(filteredDataMatrix) <- dataFiltered$EnsemblID
d = DGEList(counts = filteredDataMatrix, group = samples$condition)
d = calcNormFactors(d)
normalizedCounts <- cpm(d)
Now we will merge the normalizedCounts with our identifiers that we mapped in our previous step. Now we have a dataFrame with mapped ensembl_gene_id, HUGO gene symbols and normalized expression data.
dataFilteredAnnot <- merge(dataIdConversion, normalizedCounts, by.x = 1, by.y = 0, all.y=TRUE)
kable(dataFilteredAnnot[1:5,1:8],type = "html", caption = "Table 0.5: Normalized Data")
| ensembl_gene_id | hgnc_symbol | shNEDD9_1_S10 | shNEDD9_2_S11 | shNEDD9_3_S12 | shNTC_1_S7 | shNTC_2_S8 | shNTC_3_S9 |
|---|---|---|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 8.056319 | 10.05134 | 9.355297 | 14.791616 | 13.169145 | 13.448793 |
| ENSG00000000419 | DPM1 | 63.443514 | 59.16293 | 58.143675 | 41.819933 | 35.431272 | 42.222956 |
| ENSG00000000457 | SCYL3 | 35.078557 | 32.82588 | 33.699189 | 20.842732 | 22.889229 | 22.518910 |
| ENSG00000000460 | C1orf112 | 24.168958 | 20.99330 | 21.829027 | 8.068154 | 5.330368 | 8.757354 |
| ENSG00000001036 | FUCA2 | 76.367193 | 69.34149 | 69.510864 | 52.039594 | 62.710216 | 52.387741 |
The density plots for pre-normalization and post-normalization data is plotted for comparison.
pre_data2plot <- log2(cpm(dataFiltered[, 3:8]))
pre_counts_density <- apply(log2(cpm(dataFiltered[, 3:8])), 2, density)
xlim <- 0
ylim <- 0
for (i in 1:length(pre_counts_density)) {
xlim <- range(c(xlim, pre_counts_density[[i]]$x));
ylim <- range(c(ylim, pre_counts_density[[i]]$y))
}
cols <- rainbow(length(pre_counts_density))
ltys <- rep(1, length(pre_counts_density))
#plot the first density plot to initialize the plot
plot(pre_counts_density[[1]], xlim=xlim, ylim=ylim, type="n", ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(pre_counts_density)) lines(pre_counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(pre_data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
Fig. 0.1: Density Plot for Pre-normalization data
post_data2plot <- log2(cpm(dataFilteredAnnot[, 3:8]))
post_counts_density <- apply(log2(cpm(dataFilteredAnnot[, 3:8])), 2, density)
xlim <- 0
ylim <- 0
for (i in 1:length(post_counts_density)) {
xlim <- range(c(xlim, post_counts_density[[i]]$x))
ylim <- range(c(ylim, post_counts_density[[i]]$y))
}
cols <- rainbow(length(post_counts_density))
ltys <- rep(1, length(post_counts_density))
#plot the first density plot to initialize the plot
plot(post_counts_density[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
#plot each line
for (i in 1:length(post_counts_density)) lines(post_counts_density[[i]], col=cols[i], lty=ltys[i])
#create legend
legend("topright", colnames(post_data2plot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
Fig. 0.2: Density Plot for Post-normalization data
Notice no significance difference between pre and post normalization. I believe this is due to the fact that my original data happens to be close to a normal distribution, thus normalization with TMM method did not make any significant changes.
We will now deal with missing identifiers (missing HUGO gene symbols). First, we will check how many missing identifiers are present in our data right now.
ensembl_id_missing_gene <- dataFilteredAnnot$ensembl_gene_id[which((dataFilteredAnnot$hgnc_symbol) == "")]
length(ensembl_id_missing_gene)
## [1] 732
We see that we have 732 missing identifiers, lets check out some of these rows with missing identifiers.
kable(dataFilteredAnnot[which((dataFilteredAnnot$hgnc_symbol == ""))[1:5],1:8], type="html", caption = "Table 0.6: Genes with missing identifiers")
| ensembl_gene_id | hgnc_symbol | shNEDD9_1_S10 | shNEDD9_2_S11 | shNEDD9_3_S12 | shNTC_1_S7 | shNTC_2_S8 | shNTC_3_S9 | |
|---|---|---|---|---|---|---|---|---|
| 3142 | ENSG00000111788 | 4.69952 | 4.834820 | 4.526757 | 1.7481001 | 1.8813065 | 2.1893384 | |
| 8607 | ENSG00000165121 | 1.34272 | 2.544642 | 1.710108 | 2.8238540 | 2.8219597 | 3.1276263 | |
| 9080 | ENSG00000167912 | 2.34976 | 1.017857 | 1.609514 | 0.6723462 | 0.3135511 | 0.9382879 | |
| 9416 | ENSG00000170089 | 12.08448 | 10.178568 | 11.266595 | 6.3200541 | 7.8387770 | 5.9424901 | |
| 9425 | ENSG00000170161 | 1.00704 | 1.017857 | 1.307730 | 0.9412847 | 0.9406532 | 0.1563813 |
As the missing identifiers are only ~5% of the data set, we will keep these rows for now and continue with our analysis. Also, we have knowledge of the rows with missing identifiers which will allow us to remove/modify how we handle these missing identifiers easily later on in our analysis and removing these data at this step does not seem the best practice (would be interesting to see how these genes with missing identifiers result in our final analysis).
Now we present the final data set that has been filtered, normalized and mapped to HUGO gene symbols with 14683 genes (rows).
kable(head(dataFilteredAnnot), format = "html", caption = "Table 0.7: Final Dataset")
| ensembl_gene_id | hgnc_symbol | shNEDD9_1_S10 | shNEDD9_2_S11 | shNEDD9_3_S12 | shNTC_1_S7 | shNTC_2_S8 | shNTC_3_S9 |
|---|---|---|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 8.056319 | 10.05134 | 9.355297 | 14.791616 | 13.169145 | 13.448793 |
| ENSG00000000419 | DPM1 | 63.443514 | 59.16293 | 58.143675 | 41.819933 | 35.431272 | 42.222956 |
| ENSG00000000457 | SCYL3 | 35.078557 | 32.82588 | 33.699189 | 20.842732 | 22.889229 | 22.518910 |
| ENSG00000000460 | C1orf112 | 24.168958 | 20.99330 | 21.829027 | 8.068154 | 5.330368 | 8.757354 |
| ENSG00000001036 | FUCA2 | 76.367193 | 69.34149 | 69.510864 | 52.039594 | 62.710216 | 52.387741 |
| ENSG00000001084 | GCLC | 50.519835 | 58.14507 | 59.250216 | 96.414442 | 91.556915 | 94.610697 |
dim(dataFilteredAnnot)
## [1] 14684 8
First we will remove duplicate genes as I have made a mistake in this step for A1. We see that there are two duplicate gene_ids each with a frequency of 2.
summarizedGeneCounts <- sort(table(dataFilteredAnnot$ensembl_gene_id), decreasing = TRUE)
kable(summarizedGeneCounts[which(summarizedGeneCounts > 1)[1:5]], format = 'html', caption = "Table 1: Duplicate Genes and it's Frequencies.")
| Var1 | Freq |
|---|---|
| ENSG00000230417 | 2 |
| ENSG00000254876 | 2 |
| NA | NA |
| NA | NA |
| NA | NA |
We will then remove 1 row for each duplicate ID as it contains exactly the same data.
dataFilteredAnnot <- dataFilteredAnnot[-c(12987, 13672),]
Then we will check the MDS plots to decide factors in our model as I have not included a MDS plot in A1.
The MDS plot colored by condition (red for shNEDD9 and blue for shNTC) show that expression depends heavily on the experiment condition (execpt shNTC_2 which seems to be an outlier).
plotMDS(d, labels = rownames(samples), col = c("red", "blue")[factor(samples$condition)])
legend(-1.5, 0.4, legend=c("shNEDD9", "shNTC"), fill = c("red", "blue"), cex = 0.6)
Fig. 1: MDS plot coloured by experimental condition (shNEDD9 vs shNTC)
As we can see the groups are clustered based on cell type except shNTC_2, we will create a design matrix with samples$condition. Then we will create an expression matrix.
We first design a model with only condition as factor since our MDS plot shows that condition is the most significant factor.
modelDesign <- model.matrix( ~ samples$condition)
kable(modelDesign[1:6,], type="html", caption = "Table 2: Model Design Matrix with only condition as factor")
| (Intercept) | samples$conditionshNTC |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
Then we create an expression matrix and fit our model to get our p-values as shown in the below table.
library(limma)
expressionMatrix <- as.matrix(dataFilteredAnnot[, 3:8])
rownames(expressionMatrix) <- dataFilteredAnnot$ensembl_gene_id
colnames(expressionMatrix) <- colnames(dataFilteredAnnot[3:8])
minimalSet <- ExpressionSet(assayData = expressionMatrix)
fit <- lmFit(minimalSet, modelDesign)
fit2 <- eBayes(fit, trend = TRUE)
topfit <- topTable(fit2, coef = ncol(modelDesign), adjust.method = "BH", number = nrow(expressionMatrix))
outputHits <- merge(dataFilteredAnnot[, 1:2], topfit, by.y= 0, by.x = 1, all.y = TRUE)
kable(outputHits[1:10, 1:8], type = "html", row.names = FALSE, caption = "Table 3: Results from Limma method for differential analysis")
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B |
|---|---|---|---|---|---|---|---|
| ENSG00000000003 | TSPAN6 | 4.648867 | 11.478751 | 4.873394 | 0.0002057 | 0.0004888 | -0.0602216 |
| ENSG00000000419 | DPM1 | -20.425318 | 50.037379 | -7.595332 | 0.0000017 | 0.0000063 | 2.5506801 |
| ENSG00000000457 | SCYL3 | -11.784252 | 27.975749 | -6.234617 | 0.0000164 | 0.0000492 | 1.4054312 |
| ENSG00000000460 | C1orf112 | -14.945135 | 14.857859 | -11.479847 | 0.0000000 | 0.0000001 | 4.5631682 |
| ENSG00000001036 | FUCA2 | -16.027333 | 63.726184 | -4.986969 | 0.0001650 | 0.0003994 | 0.0749498 |
| ENSG00000001084 | GCLC | 38.222312 | 75.082862 | 13.108581 | 0.0000000 | 0.0000000 | 5.0594782 |
| ENSG00000001167 | NFYA | 12.468577 | 57.495405 | 5.076752 | 0.0001388 | 0.0003415 | 0.1801637 |
| ENSG00000001460 | STPG1 | 1.861078 | 9.591631 | 1.898136 | 0.0771967 | 0.1076563 | -4.0498305 |
| ENSG00000001461 | NIPAL3 | 118.396417 | 168.288342 | 34.909691 | 0.0000000 | 0.0000000 | 6.8356876 |
| ENSG00000001497 | LAS1L | -29.382631 | 76.381169 | -10.735174 | 0.0000000 | 0.0000001 | 4.2829923 |
From the limma method, we see that 10108 genes are significantly deferentially expressed with p-value threshold of 0.05. The threshold of 0.05 was chosen because it is considered the standard (to my knowledge). Eventhough we ended up with more genes passing than expected with the 0.05 p-value threshold, we are keeping the threshold to avoid a form of “p-hacking” (changing the threshold only to reduce the number of genes passing).
length(which(outputHits$P.Value < 0.05))
## [1] 10113
We also see 9728 genes passing correction using the Benjamni - hochberg method with threshold of 0.05. The Benjamni - hochberg method was chosen as I remember it being mentioned as a standard method for RNA seq differential expression analysis from class (correct me if I am wrong here). Also, the threshold of 0.05 was chosen again to be consistant and seems to be the standard threshold in most cases.
length(which(outputHits$adj.P.Val < 0.05))
## [1] 9732
Now we will use the Quasi likelihood of the EdgeR package to calculate p-values as the following:
d <- DGEList(counts=expressionMatrix, group=samples$condition)
d <- estimateDisp(d, modelDesign)
fit <- glmQLFit(d, modelDesign)
r <- glmQLFTest(fit)
qlfOutputHits <- topTags(r, n = nrow(dataFilteredAnnot), adjust.method = "BH")
kable(qlfOutputHits$table[1:10, 1:5], type = "html", row.names = FALSE, caption = "Table 4: Results from Quasi liklihood method for differential analysis")
| logFC | logCPM | F | PValue | FDR |
|---|---|---|---|---|
| 3.945787 | 12.193073 | 26384.356 | 0 | 0 |
| 2.894103 | 10.991419 | 15600.934 | 0 | 0 |
| 3.872776 | 10.003619 | 15287.730 | 0 | 0 |
| -2.398194 | 10.114546 | 9561.174 | 0 | 0 |
| -4.160010 | 8.748915 | 8437.388 | 0 | 0 |
| -2.577343 | 9.624646 | 8436.453 | 0 | 0 |
| 3.218526 | 8.940245 | 8399.667 | 0 | 0 |
| 1.983406 | 10.095593 | 6811.047 | 0 | 0 |
| 3.004369 | 8.754974 | 6793.744 | 0 | 0 |
| 4.897888 | 7.848513 | 6517.317 | 0 | 0 |
From the Quasi likelihood method, we see that 6336 genes are significantly deferentially expressed with p-value threshold of 0.05. Again, we are using a p-value threshold of 0.05 as it seems to be the standard for these differential analysis.
length(which(qlfOutputHits$table$PValue < 0.05))
## [1] 6336
We also see 5354 genes passing correction using the Benjamni - hochberg method with threshold of 0.05. Same reason as the limma method, threshold of 0.05 and Quasi likelihood method were selected as these two were considered “standards” during class.
length(which(qlfOutputHits$table$FDR < 0.05))
## [1] 5354
We will continue our analysis from results of the Quasi Likelihood Method as “Limma guide direct users to use edgeR up to the point of calculating differential expression.” I purposely went through the Limma method earlier to compare results and practice the process of using Limma, but now we will only consider results from the Quasi Likelihood method for the rest of our analysis.
Now we will create a volcano plot to show the amounts of differentially expressed genes. Non-signiciant genes in grey, up regualated genes in red and down regualated genes in blue. The gene of interest, NEDD9, is colored black to highlight.
col <- vector(mode="character", length = nrow(dataFilteredAnnot))
for (i in 1:nrow(qlfOutputHits$table)) {
if (qlfOutputHits$table$logFC[i] < 0 && qlfOutputHits$table$FDR[i] < 0.05) {
col[i] <- "blue"
} else if (qlfOutputHits$table$logFC[i] > 0 && qlfOutputHits$table$FDR[i] < 0.05) {
col[i] <- "red"
} else {
col[i] <- "grey"
}
}
col[which(row.names(qlfOutputHits$table) == "ENSG00000111859")] <- "black"
plot(qlfOutputHits$table$logFC,
-log(qlfOutputHits$table$FDR, base=10),
col = col,
xlab = "logFC",
ylab ="-log(FDR)",
main="Volcano Plot of Differentially Expressed Genes")
legend(-6.9, 68, legend=c("Up Regulated Genes", "Down Regulated genes", "Non-significant", "NEDD9"), fill = c("blue", "red", "grey", "black"), cex = 0.5)
Fig. 2: Volcano Plot of differentially expressed genes with FDR
Note that NEDD9 is in the up-regulated gene because I ran the analysis backwards NEDD9 silenced (shNEDD9 condition) vs Control (shNTC condition) instead of control vs silenced.
We will portray a heatmap of the top hits with the following:
if(!requireNamespace("circlize", quietly = TRUE)) {
install.packages("circlize")
}
if(!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
library(circlize)
library(ComplexHeatmap)
hmMatrix <- dataFilteredAnnot[, 3:ncol(dataFilteredAnnot)]
rownames(hmMatrix) <- dataFilteredAnnot$ensembl_gene_id
topHits <- rownames(qlfOutputHits$table)[qlfOutputHits$table$FDR < 0.05]
hmTopHits <- t(scale(t(hmMatrix[which(rownames(hmMatrix) %in% topHits), ])))
if (min(hmTopHits) == 0){
hmCol = colorRamp2(c( 0, max(hmTopHits)),
c( "white", "red"))
} else {
hmcol = colorRamp2(c(min(hmTopHits), 0, max(hmTopHits)), c("blue", "white", "red"))
}
Heatmap(as.matrix(hmTopHits),
cluster_rows = TRUE, show_row_dend = TRUE,
cluster_columns = FALSE, show_column_dend = FALSE,
col=hmcol, show_column_names = TRUE,
show_row_names = FALSE, show_heatmap_legend = TRUE,
column_title = "Heatmap of Top Hits")
Fig. 3: Heatmap of Top Hits
From the heatmap, we can notice that the conditions do cluster together (shNEDD9 clustered on the left and shNTC clustered on the right). I believe this is due to the experimental design, when NEDD9 is suppressed the significantly differentiated genes will be up/down regulated in a similar way in the same NEDD9 suppressed condition (Same idea for the control, since NEDD9 is not suppressed, the genes up/down regulated will be similar for the control conditions).
We will create a thresholded lists of genes.
upRegGenes <- row.names(qlfOutputHits$table)[which(qlfOutputHits$table$FDR < 0.05 & qlfOutputHits$table$logFC > 0)]
downRegGenes <- row.names(qlfOutputHits$table)[which(qlfOutputHits$table$FDR < 0.05 & qlfOutputHits$table$logFC < 0)]
allGenes <- c(upRegGenes, downRegGenes)
Now we save the list of genes in seperate files (table) for later access.
write.table(x=allGenes,
"./allGenes.txt",sep='\t',
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=upRegGenes,
"./upRegGenes.txt",sep='\t',
row.names = FALSE,col.names = FALSE,quote = FALSE)
write.table(x=downRegGenes,
"./downRegGenes.txt",sep='\t',
row.names = FALSE,col.names = FALSE,quote = FALSE)
For over-representation analysis, we will use the g:Profiler method as it allows access to multiple annotation data sources such as Go biological process, Reactom and etc within the method. There is also a R package named “gprofiler2” which is a convenient way to run g:profiler within R (we will be doing this).
Let’s begin with loading the “gprofiler2” library.
if (!requireNamespace("gprofiler2", quietly = TRUE)){
install.packages("gprofiler2")
}
library(gprofiler2)
For annotation data, we will be using the Go biological process (GO-BP) and Reactome. GO-BP was chosen for the relevant biological processes of the significantly differentiated genes. Reactome was also chosen for relevent biological pathways of the significantly differentiated genes (Reactome and WikiPathways are both databases for biological pathways so I decided to only choose Reactome for this assignment). The combination of GO-BP and Reactome seems to be most reasonable giving annotation data of the biological processes and pathways.
The version for GO-BP:
version <- gprofiler2::get_version_info(organism = "hsapiens")
goBPVersion <- version$sources$`GO-BP`$version
goBPVersion
## NULL
The version for Reactome:
reactomeVersion <- version$sources$REAC$version
reactomeVersion
## [1] "annotations: BioMart\nclasses: 2022-1-3"
From above, we can notice that the used GO-BP version is the 2021-12-15 version. While the Reactome version is the 2022-01-03 version.
We will first run a g:Profiler query for all significantly differentiated genes with the GO-BP and Reactome annotations.
allGenesResults <- gost(
allGenes,
organism = "hsapiens",
ordered_query = FALSE,
multi_query = FALSE,
significant = FALSE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = c("fdr"),
sources = c("GO-BP", "REAC"),
as_short_link = FALSE
)
Lets check how many genesets are returned with threshold of FDR < 0.05, we can see that there are 14565 gene sets returned from GO-BP and Reactome.
length(allGenesResults$result$term_id)
## [1] 2122
Now, we can get the top GO-BP results for querying all genes:
kable(head(allGenesResults$result[, c("term_name", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 5: Top GO-BP Results for All Genes")
| term_name | term_id | p_value |
|---|---|---|
| Cell Cycle | REAC:R-HSA-1640170 | 0 |
| Translation | REAC:R-HSA-72766 | 0 |
| Cell Cycle, Mitotic | REAC:R-HSA-69278 | 0 |
| Metabolism of RNA | REAC:R-HSA-8953854 | 0 |
| Cellular responses to stress | REAC:R-HSA-2262752 | 0 |
| Cellular responses to stimuli | REAC:R-HSA-8953897 | 0 |
From the GO-BP results of all genes, it seems like the terms are related to cellular metabolic processes (macromolecules, protein and organonitrogen). Since the NEDD9 Gene is a member of the CRK-associated substrates family which play a role as a adhesion docking molecule, the top results of GO-BP (cellular metabolic processes) seems to make sense.
We also can get the top Reactome results for querying all genes:
kable(head(allGenesResults$result[allGenesResults$result$source == "REAC", c("term_name", "source", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 6: Top Reactome Results for All Genes")
| term_name | source | term_id | p_value |
|---|---|---|---|
| Cell Cycle | REAC | REAC:R-HSA-1640170 | 0 |
| Translation | REAC | REAC:R-HSA-72766 | 0 |
| Cell Cycle, Mitotic | REAC | REAC:R-HSA-69278 | 0 |
| Metabolism of RNA | REAC | REAC:R-HSA-8953854 | 0 |
| Cellular responses to stress | REAC | REAC:R-HSA-2262752 | 0 |
| Cellular responses to stimuli | REAC | REAC:R-HSA-8953897 | 0 |
From Entrex Gene Summary of NEDD9, it is known that NEDD9 plays a role in apoptosis and the cell cycle. Thus, the results from Reactome makes a lot of sense with the Cell Cycle term as the top result. I’m guessing the Cellular Response to stress/stimuli term is regarding apoptosis/cell cycle (but I am not too sure).
Now we will run the same query but only with the up-regulated genes.
upRegGenesResults <- gost(
upRegGenes,
organism = "hsapiens",
ordered_query = FALSE,
multi_query = FALSE,
significant = FALSE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = c("fdr"),
sources = c("GO-BP", "REAC"),
as_short_link = FALSE
)
Lets check how many genesets are returned with threshold of FDR < 0.05, we can see that there are 12087 gene sets returned from GO-BP and Reactome.
length(upRegGenesResults$result$term_id)
## [1] 1853
Now, we can get the top GO-BP results for querying up-regulated genes:
kable(head(upRegGenesResults$result[, c("term_name", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 7: Top GO-BP Results for Up-Regulated Genes")
| term_name | term_id | p_value |
|---|---|---|
| Signaling by Rho GTPases, Miro GTPases and RHOBTB3 | REAC:R-HSA-9716542 | 0 |
| Membrane Trafficking | REAC:R-HSA-199991 | 0 |
| Signaling by Rho GTPases | REAC:R-HSA-194315 | 0 |
| RHO GTPase cycle | REAC:R-HSA-9012999 | 0 |
| Signal Transduction | REAC:R-HSA-162582 | 0 |
| Vesicle-mediated transport | REAC:R-HSA-5653656 | 0 |
The first thing we notice is the top terms are different from the top terms of the “all genes” GO-BP query. The up-regulated genes seem to be more focused on localization and development process.
We also can get the top Reactome results for querying up-regulated genes:
kable(head(upRegGenesResults$result[upRegGenesResults$result$source == "REAC", c("term_name", "source", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 8: Top Reactome Results for Up-Regulated Genes")
| term_name | source | term_id | p_value |
|---|---|---|---|
| Signaling by Rho GTPases, Miro GTPases and RHOBTB3 | REAC | REAC:R-HSA-9716542 | 0 |
| Membrane Trafficking | REAC | REAC:R-HSA-199991 | 0 |
| Signaling by Rho GTPases | REAC | REAC:R-HSA-194315 | 0 |
| RHO GTPase cycle | REAC | REAC:R-HSA-9012999 | 0 |
| Signal Transduction | REAC | REAC:R-HSA-162582 | 0 |
| Vesicle-mediated transport | REAC | REAC:R-HSA-5653656 | 0 |
We also notice here that the top terms are different from the top terms of the “all genes” Reactome query. The up-regualted genes (genes more expressed in control since we did the analysis in reverse like mentioned earlier) seem to be more focused on membrane trafficking / signaling while “all genes” query is focusing more on cell cycle.
Now we will run the same query but only with the down-regulated genes.
downRegGenesResults <- gost(
downRegGenes,
organism = "hsapiens",
ordered_query = FALSE,
multi_query = FALSE,
significant = FALSE,
exclude_iea = FALSE,
measure_underrepresentation = FALSE,
evcodes = FALSE,
user_threshold = 0.05,
correction_method = c("fdr"),
sources = c("GO-BP", "REAC"),
as_short_link = FALSE
)
Lets check how many genesets are returned with threshold of FDR < 0.05, we can see that there are 11299 gene sets returned from GO-BP and Reactome.
length(downRegGenesResults$result$term_id)
## [1] 1717
Now, we can get the top GO-BP results for querying down-regulated genes:
kable(head(downRegGenesResults$result[, c("term_name", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 9: Top GO-BP Results for Down-Regulated Genes")
| term_name | term_id | p_value |
|---|---|---|
| Translation | REAC:R-HSA-72766 | 0 |
| Cell Cycle | REAC:R-HSA-1640170 | 0 |
| Metabolism of RNA | REAC:R-HSA-8953854 | 0 |
| Cell Cycle, Mitotic | REAC:R-HSA-69278 | 0 |
| rRNA processing | REAC:R-HSA-72312 | 0 |
| Cap-dependent Translation Initiation | REAC:R-HSA-72737 | 0 |
Interestingly, the GO-BP results from down-regulated genes are similar to the “all genes” results with both results focusing on cellular macromolecule processes/metabolic processes.
We also can get the top Reactome results for querying down-regulated genes:
kable(head(downRegGenesResults$result[downRegGenesResults$result$source == "REAC", c("term_name", "source", "term_id", "p_value")]), type = "html", row.names = FALSE, caption = "Table 10: Top Reactome Results for Down-Regulated Genes")
| term_name | source | term_id | p_value |
|---|---|---|---|
| Translation | REAC | REAC:R-HSA-72766 | 0 |
| Cell Cycle | REAC | REAC:R-HSA-1640170 | 0 |
| Metabolism of RNA | REAC | REAC:R-HSA-8953854 | 0 |
| Cell Cycle, Mitotic | REAC | REAC:R-HSA-69278 | 0 |
| rRNA processing | REAC | REAC:R-HSA-72312 | 0 |
| Cap-dependent Translation Initiation | REAC | REAC:R-HSA-72737 | 0 |
The Reactome results from down-regulated genes are also similar to the “all genes” results with both results focusing on cell cycle and translation.
As mentioned also in A2, note that NEDD9 is in the up-regulated gene because I ran the analysis backwards NEDD9 silenced (shNEDD9 condition) vs Control (shNTC condition) instead of control vs silenced. This is important for interpretation since the positive enrichment will mean more expressed in the NEDD silenced condition and negative enrichment will mean less expressed in the NEDD silenced condition. Essentially, the results should be interpretated backwards from the conventional Control vs Silenced model.
First, we will create a ranked set of genes according to the GSEA format from our ranked genes list in A2 as we did not create one in A2.
qlfOutputHits$table[, "rank"] <- -log(qlfOutputHits$table$PValue, base = 10) * sign(qlfOutputHits$table$logFC)
qlfOutputHits$table <- merge(dataIdConversion, qlfOutputHits$table, by.x = 1, by.y = 0, all.y=TRUE)
qlfOutputHits$table <- qlfOutputHits$table[order(qlfOutputHits$table$rank, decreasing = TRUE),]
ranked_genes_list <- data.frame(GeneName = qlfOutputHits$table$hgnc_symbol, rank = qlfOutputHits$table[,"rank"])
write.table(x=ranked_genes_list, file="ranked_genes_list.rnk",sep = "\t", row.names = FALSE, quote = FALSE)
kable(head(ranked_genes_list), type = "html", row.names = FALSE, caption = "Table 1: Ranked Genes List for GSEA")
| GeneName | rank |
|---|---|
| REG4 | 69.65442 |
| IGFBP3 | 63.90161 |
| PLA2G2A | 63.67974 |
| WDR7 | 57.13595 |
| THBS1 | 54.84997 |
| ADAMTS1 | 54.82225 |
Now we will download the required geneset from Bader Lab for GSEA input. Note that we will use HUGO gene symbols that we matched in A1 for our gene identifier as GSEA will expect HUGO symbols because our geneset database is annotated in HUGO symbols.
if (!requireNamespace("RCurl", quietly = TRUE)){
install.packages("RCurl")
}
library(RCurl)
gmt_url <- "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
filenames <- getURL(gmt_url)
tc <- textConnection(filenames)
contents <- readLines(tc)
close(tc)
rx <- gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea_.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file <- unlist(regmatches(contents, rx))
if (!file.exists(paste("data/", gmt_file, sep = ""))){
dest <- file.path(paste(getwd(), "data", sep = ""), gmt_file)
download.file(paste(gmt_url,gmt_file,sep=""), destfile=dest)
}
For the Non-thresholded Gene set Enrichment Analysis, I will be using the GSEA Preranked method of GSEA software version 4.2.3. I will be running GSEA 4.2.3 on my local laptop (M1 Mac) with the following parameters:
As there were missing HUGO symbols for geneIds (check A1 Mapping Data Section for detail), GSEA gave a warning saying 75 genes were ignored due to N/A values. I decided to move on with the warnings as these genes were only a very small part of the data, but I will take a note of this for our future analysis. Please let me know if there are better methods to handle this.
Table 2 shows a summary of the top gene sets for positive enrichment from the GSEA analysis. From the result, we see the top positive enrichment results are related to exocytosis, extracellular structure organization and regulation of exocytosis.
gsea_pos_results <- read.delim("gsea_report_for_na_pos.tsv", row.names = NULL)
gsea_pos_results <- data.frame(name = gsea_pos_results$NAME, size = gsea_pos_results$SIZE, ES = gsea_pos_results$ES, NES = gsea_pos_results$NES, NOM_p_val = gsea_pos_results$NOM.p.val, FDR_q_val = gsea_pos_results$FDR.q.val, FWER_p_val = gsea_pos_results$FWER.p.val, rank_at_max = gsea_pos_results$RANK.AT.MAX, leading_edge = gsea_pos_results$LEADING.EDGE)
kable(head(gsea_pos_results), type = "html", row.names = FALSE, caption = "Table 2: GSEA Positive Enrichment Results")
| name | size | ES | NES | NOM_p_val | FDR_q_val | FWER_p_val | rank_at_max | leading_edge |
|---|---|---|---|---|---|---|---|---|
| EXOCYTOSIS%GOBP%GO:0006887 | 130 | 0.7220704 | 1.949632 | 0 | 0.0023131 | 0.002 | 2168 | tags=33%, list=16%, signal=39% |
| SENSORY PERCEPTION%REACTOME%R-HSA-9709957.3 | 118 | 0.7260143 | 1.933759 | 0 | 0.0011566 | 0.002 | 1426 | tags=29%, list=10%, signal=32% |
| EXTRACELLULAR STRUCTURE ORGANIZATION%GOBP%GO:0043062 | 111 | 0.7326065 | 1.926554 | 0 | 0.0011503 | 0.003 | 1228 | tags=23%, list=9%, signal=25% |
| EXTRACELLULAR MATRIX ORGANIZATION%GOBP%GO:0030198 | 110 | 0.7328299 | 1.919529 | 0 | 0.0008627 | 0.003 | 1228 | tags=24%, list=9%, signal=26% |
| REGULATION OF EXOCYTOSIS%GOBP%GO:0017157 | 121 | 0.7286935 | 1.918481 | 0 | 0.0006902 | 0.003 | 1733 | tags=26%, list=12%, signal=29% |
| EXTERNAL ENCAPSULATING STRUCTURE ORGANIZATION%GOBP%GO:0045229 | 112 | 0.7301717 | 1.898211 | 0 | 0.0007655 | 0.004 | 1228 | tags=23%, list=9%, signal=25% |
Table 3 shows a summary of the top gene sets for negative enrichment from the GSEA analysis. From the result, we see the top negative enrichment results are related to various translations and translation initiation.
gsea_neg_results <- read.delim("gsea_report_for_na_neg.tsv", row.names = NULL)
gsea_neg_results <- data.frame(name = gsea_neg_results$NAME, size = gsea_neg_results$SIZE, ES = gsea_neg_results$ES, NES = gsea_neg_results$NES, NOM_p_val = gsea_neg_results$NOM.p.val, FDR_q_val = gsea_neg_results$FDR.q.val, FWER_p_val = gsea_neg_results$FWER.p.val, rank_at_max = gsea_neg_results$RANK.AT.MAX, leading_edge = gsea_neg_results$LEADING.EDGE)
kable(head(gsea_neg_results), type = "html", row.names = FALSE, caption = "Table 3: GSEA Negative Enrichment Results")
| name | size | ES | NES | NOM_p_val | FDR_q_val | FWER_p_val | rank_at_max | leading_edge |
|---|---|---|---|---|---|---|---|---|
| CAP-DEPENDENT TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72737 | 115 | -0.9059194 | -2.385736 | 0 | 0 | 0 | 823 | tags=63%, list=6%, signal=66% |
| EUKARYOTIC TRANSLATION INITIATION%REACTOME DATABASE ID RELEASE 79%72613 | 115 | -0.9059194 | -2.370838 | 0 | 0 | 0 | 823 | tags=63%, list=6%, signal=66% |
| CYTOPLASMIC TRANSLATION%GOBP%GO:0002181 | 120 | -0.9004626 | -2.346311 | 0 | 0 | 0 | 823 | tags=60%, list=6%, signal=63% |
| NONSENSE-MEDIATED DECAY (NMD)%REACTOME%R-HSA-927802.2 | 109 | -0.9041099 | -2.336048 | 0 | 0 | 0 | 651 | tags=55%, list=5%, signal=57% |
| GTP HYDROLYSIS AND JOINING OF THE 60S RIBOSOMAL SUBUNIT%REACTOME%R-HSA-72706.2 | 108 | -0.9082190 | -2.334970 | 0 | 0 | 0 | 970 | tags=68%, list=7%, signal=72% |
| HALLMARK_E2F_TARGETS%MSIGDB_C2%HALLMARK_E2F_TARGETS | 152 | -0.8626322 | -2.331709 | 0 | 0 | 0 | 1311 | tags=72%, list=9%, signal=79% |
For the positive enrichment results, we see the top terms of the thresholded analysis results are related to signaling and Membrane Trafficking; while top terms of the non-thresholded analysis results are related to exocytosis, extracellular structure organization. There is minimal similarity between the two results as the top results are mostly different between the Thresholded and Non-thresholded analysis.
However, the negative enrichment results from the thresholded and non-thresholded analysis are quite similar. Both thresholded and non-thresholded analysis top results are related to translation. The main difference I noticed was that the non-thresholded analysis results were more specific (ex. CAP-dependent translation) while the thresholded analysis result only mentions “translation.”
The comparison doesn’t seem to be straightforward comparison as non-thresholded takes account of all genes (hence non-thresholded) while thresholded analysis only takes genes that pass the threshold. In addition, the 75 genes ignored in GSEA make it seem like a not straightforward comparison as these genes were used in thresholded analysis (as g:profiler used gene ids and not the mapped hugo symbols).
Cytoscape 3.9.1 (Shannon et al. 2003) with the Enrichment Map plugin 3.3.3 (Merico et al. 2010) was used for creating the enrichment map. The following parameters/thresholds were used to creat the enrichment map:
I decided to use these thresholds as it was recommended in the enrichment map manual for moderately conservative thresholds.
The enrichment map before manual layout:
Figure 1. Enrichment Map before Manual Layout
For annotating the network I used the AutoAnnotate plugin (Kucera et al. 2016) to cluster and annotate the network with the following parameters:
I annotated the entire network to not miss out on some nodes, but I could always manually hide some small clusters that I don’t need later on for a clearer image.
The publication ready figure was created with the publication-ready option with all annotations:
Figure 2. Publication Ready Enrichment Map
The network was collapsed into a theme network using the AutoAnnotate Create Summary Network functionality with all nodes/clusters collapsed. Theme network:
Figure 3. Theme Network of all nodes/clusters
Main themes include translation, signaling and mitosis/meiosis cycles. The main themes don’t exactly fit with the model of over-expression NEDD9 causing prostate cancer progression and metastasis. However, themes such as signaling, cell cycles and translation may be part of promoting cancer growth and metastasis. Lastly, I don’t notice any novel pathways/themes present in the theme network.
Do the enrichment results support conclusions or mechanism discussed in the original paper? How do these results differ from the results you got from Assignment #2 thresholded methods.
The enrichment results do not directly support the conclusions or mechanism discussed in the original paper, but do show some relevance to the conclusions. The original paper states that NEDD9 gene is suspected to play a role in tumor growth / metastasis for prostate cancer(Han et al. 2021). From the enrichment analysis, we see results related to translation, cell cycle regulation and signaling pathways that may be relevant to prostate cancer growth and metastasis. However, I believe this is not enough to completely conclude that the enrichment results really support the conclusions discussed in the paper.
Comparing the GSEA results with the thresholded analysis results from A2, there is high similarity between the negative enrichment results and minal similarity between the positive enrichment results. Both thresholded and non-thresholded negative enrichment results show translation as the top results showing similarity. However, the positive enrichment results differ between the non-tresholded and thresholded results as thresholded top results show signalling and membrane trafficking, meanwhile non-thresholded top results show exocytosis and extracellular structure organization.
Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your result?
From doing more research, I found that the mechanism of how NEDD9 regulates metastasis and cancer growth is largely unknown (Shagisultanova et al. 2015) which is the reason why I can’t completely conclude that the enrichment results support the conclusion that NEDD9 promotes metastasis/cell invasion. However, the paper NEDD9/Arf6-dependent endocytic trafficking of matrix metalloproteinase 14 (Shagisultanova et al. 2015), discusses the role of NEDD9 in redistribution of MMP14 to the cell surface (extra-cellular structure organization) and trafficking which support some main results (signalling, membrane trafffiking and extracellular structure organization) from the enrichment analysis.
I will be adding a post analysis to the main network using the drug bank gene set to find potential drugs related to our network. I decided to add drugs for post-analysis to see if any cancer/metastasis related drugs will come out as top hits to support conclusions from the original paper. In addition, I was generally curious if any drug related to the target NEDD9 gene will come up as a top result.
The following gene set and parameters were used for adding the post analysis:
A total of 161 gene sets out of 1972 gene sets passed the Mann-Whitney (One Sided Greater) cutoff of 0.05. The top hits are shown in Figure 4. below with fostamatinib as the top result.
Figure 4. DrugBank Top Passed Gene Set list
For the network, I only added the top hit Fostamatinib in the network for clarity and because most drugs were not very relevent with the mechanism dicussed in the paper: Susceptibility-associated genetic variation in NEDD9 contributes to prostate cancer initiation and progression (Han et al. 2021). The snapshot of the network with added post analysis is shown in Figure 5.
Figure 5. Network with Post Analysis Top Hit Fostamatinib
Taking a more detailed look into Fostamatinib in DrugBank, Fostamatinib is a tyrosine kinase inhibitor used to treat chronic immune thrombocytopenia (Wishart et al. 2018). Although the original paper (Han et al. 2021) discusses the potential role of the NEDD9 gene in prostate cancer and not chronic immune thrombocytopenia, it is interesting to see Fostamatinib as the top result as sources such as sources such as GeneCards (Szklarczyk et al. 2015) show NEDD9’s role in chemokine-induced T cell migration and T cell receptor (TCR)–mediated integrin activation.
The post-analysis did not show support for the mechanism/conclusions made in the original paper, but did support other NEDD9 mechanisms discussed in papers such as Preclinical and clinical studies of the NEDD9 scaffold protein in cancer and other diseases (Szklarczyk et al. 2015). In addition, the post-analysis result aligns with the NEDD9 gene description from GeneCards (Szklarczyk et al. 2015).
Packages/Applications used: Biomanager(Cattley and Arthur 2007), GEOmetadb(GEOquery et al., n.d.), BioMart(Durinck et al. 2005), knitR(Xie 2018), Circlize(Gu et al. 2014), ComplexHeatmaps(Gu, Eils, and Schlesner 2016), Limma(Ritchie et al. 2015), edgeR(Robinson, McCarthy, and Smyth 2010), GSEA(Subramanian et al. 2007), cytoscape(Shannon et al. 2003), RCurl(Lang 2007).